You will use dplyr::filter() to keep or drop rows
from a dataframe.
You will filter rows by specifying conditions on numbers or
strings using relational operators like greater than
(>), less than (<), equal to
(==), and not equal to (!=).
You will filter rows by combining conditions using logical
operators like the ampersand (&) and the vertical bar
(|).
You will filter rows by negating conditions using the exclamation
mark (!) logical operator.
You will filter rows with missing values using the
is.na() function.
In this lesson, we will again use the data from the COVID-19 serological survey conducted in Yaounde, Cameroon.
yaounde <- read_csv(here::here('ch04_data_wrangling/data/yaounde_data.csv'))
## a smaller subset of variables
yao <- yaounde %>%
select(age, sex, weight_kg, highest_education, neighborhood,
occupation, is_smoker, is_pregnant,
igg_result, igm_result)
yaoSee the previous lesson for more information about this data.
filter()We use filter() to keep rows that satisfy a set of
conditions. Let’s take a look at a simple example. If we want to keep
just the male records, we run:
yao %>% filter(sex == "Male")Note the use of the double equal sign == rather than the
single equal sign =. The == sign tests for
equality, as demonstrated below:
## create the object `sex_vector` with three elements
sex_vector <- c("Male", "Female", "Female")
## test which elements are equal to "Male"
sex_vector == "Male"## [1] TRUE FALSE FALSE
So the code yao %>% filter(sex == "Male") will keep
all rows where the equality test sex == "Male" evaluates to
TRUE.
It is often useful to chain filter() with
nrow() to get the number of rows fulfilling a
condition.
## how many respondents were male?
yao %>%
filter(sex == "Male") %>%
nrow()## [1] 422
The double equal sign, ==, tests for equality, while the
single equals sign, =, is used for specifying values to
arguments inside functions.
Filter the yao data frame to respondents who were
pregnant during the survey. Store in q1.
How many respondents were female? (Use filter() and
nrow())
The == operator introduced above is an example of a
“relational” operator, as it tests the relation between two values. Here
is a list of some of these operators:
| Operator | is TRUE if |
| A < B | A is less than B |
| A <= B | A is less than or equal to B |
| A > B | A is greater than B |
| A >= B | A is greater than or equal to B |
| A == B | A is equal to B |
| A != B | A is not equal to B |
| A %in% B | A is an element of B |
Let’s see how to use these within filter():
yao %>% filter(sex != "Male") ## keep rows where `sex` is not "Male"yao %>% filter(age < 6) ## keep respondents under 6yao %>% filter(age >= 70) ## keep respondents aged at least 70## keep respondents whose highest education is "Primary" or "Secondary"
yao %>% filter(highest_education %in% c("Primary", "Secondary"))From yao, keep only respondents who were children (under
18).
With %in%, keep only respondents who live in the
“Tsinga” or “Messa” neighborhoods.
& and |We can pass multiple conditions to a single filter()
statement separated by commas:
## keep respondents who are pregnant and are ex-smokers
yao %>% filter(is_pregnant == "Yes", is_smoker == "Ex-smoker") ## only one rowWhen multiple conditions are separated by a comma, they are
implicitly combined with an and
(&).
It is best to replace the comma with & to make this
more explicit.
## same result as before, but `&` is more explicit
yao %>% filter(is_pregnant == "Yes" & is_smoker == "Ex-smoker")If we want to combine conditions with an or, we use
the vertical bar symbol, |.
## respondents who are pregnant OR who are ex-smokers
yao %>% filter(is_pregnant == "Yes" | is_smoker == "Ex-smoker")Filter yao to only keep men who tested IgG positive.
Filter yao to include just children (under 18s) and
those whose highest education is primary school.
!To negate conditions, we wrap them in !().
Below, we drop respondents who are children (less than 18 years) or who weigh less than 30kg:
## drop respondents < 18 years OR < 30 kg
yao %>% filter(!(age < 18 | weight_kg < 30))The ! operator is also used to negate %in%
since R does not have an operator for NOT in.
## drop respondents whose highest education is NOT "Primary" or "Secondary"
yao %>% filter(!(highest_education %in% c("Primary", "Secondary")))It is easier to read filter() statements as
keep statements, to avoid confusion over whether we are
filtering in or filtering out!
So the code below would read: “keep respondents who are under 18 or who weigh less than 30kg”.
yao %>% filter(age < 18 | weight_kg < 30)And when we wrap conditions in !(), we can then read
filter() statements as drop
statements.
So the code below would read: “drop respondents who are under 18 or who weigh less than 30kg”.
yao %>% filter(!(age < 18 | weight_kg < 30))Drop respondents who live outside of the Tsinga or Messa neighborhoods.
NA
valuesThe relational operators introduced so far do not work with
NA.
Let’s make a data subset to illustrate this.
yao_mini <- yao %>%
select(sex, is_pregnant) %>%
slice(1,11,50,2) ## custom row order
yao_miniIn yao_mini, the last respondent has an NA
for the is_pregnant column, because he is male.
Trying to select this row using == NA will not work.
yao_mini %>% filter(is_pregnant == NA) ## does not workyao_mini %>% filter(is_pregnant == "NA") ## does not workThis is because NA is a non-existent value. So R cannot
evaluate whether it is “equal to” or “not equal to” anything.
The special function is.na() is therefore necessary:
## keep rows where `is_pregnant` is NA
yao_mini %>% filter(is.na(is_pregnant)) This function can be negated with !:
## drop rows where `is_pregnant` is NA
yao_mini %>% filter(!is.na(is_pregnant))For tibbles, RStudio will highlight NA values bright red
to distinguish them from other values:
Keep all the responders who had missing records for the report of their smoking status
NAHandling NAs improperly is often a source of error.
Imagine, for example, that we we wanted to drop pregnant women from the
dataset.
We might write filter(is_pregnant != "Yes") (to be read
as “keep respondents who are not pregnant”), but this would be
wrong!
## keep rows where `is_pregnant` is not "Yes"
yao_mini %>% filter(is_pregnant != "Yes") ## badDo you see what went wrong? We wanted to drop the pregnant woman, but we dropped the man too!
This is because filter() drops all rows where the test
evaluates to NA. And the test NA != "Yes"
evaluates to NA, because R does not not whether
NA is equal to or not equal to “Yes”.
NA != "Yes" ## R does not know whether `NA` is equal to or not equal to "Yes"## [1] NA
In order to correctly filter on a column that contains
NAs, we often need to include an is.na()
condition. So to drop pregnant women without accidentally dropping the
man, we could write:
## keep rows where `is_pregnant` is not "Yes"
yao_mini %>% filter(is_pregnant != "Yes" |
is.na(is_pregnant)) ## OR `is_pregnant` is NAAlternatively, we could write:
## keep rows where `is_pregnant` is `NA`, "No" or "No response"
yao_mini %>%
filter(is.na(is_pregnant) | is_pregnant == "No" | is_pregnant == "No response"){width=“300”,height=“1000”}
For some respondents the respiration rate, in breaths per minute, was
recorded in the respiration_frequency column. From
yaounde, drop those with a respiration frequency under
20.
row_number()We sometimes need to filter based on row numbers. The dplyr helper
row_number() can help achieve this.
To keep rows 7 to 10 and row 70 we write:
yao %>% filter(row_number() %in% c(7:10, 70))To drop rows 7 to 10 we write:
yao %>% filter(!row_number() %in% 7:10)From yao, keep rows 8 to 20 and row 80.
stringr::str_detect()The yaounde dataset has a number of multiple response
variables.
The occupation variable, for example, lists multiple
responses separated by “--”:
## the fifth respondent is a trader and a farmer
yao %>% select(occupation) Such variables require special handling.
If we want to subset the data to farmers, we should not write
filter(occupation == "Farmer"), as that would drop anyone
who has a second or third occupation (e.g. “Trader–Farmer”).
The str_detect() function is useful here. This function
checks a given string (first argument) for a provided pattern (second
argument).
test_string <- "Trader--Farmer"
## returns TRUE since the test string contains "Farmer"
str_detect(string = test_string, pattern = "Farmer")## [1] TRUE
## returns FALSE since the test string does not contain "Student"
str_detect(test_string, "Student")## [1] FALSE
We can use this within filter():
yao %>%
select(occupation) %>%
## keep farmers
filter(str_detect(occupation, "Farmer"))From yao, keep respondents who are students.
The following team members contributed to this lesson:
Some material in this lesson was adapted from the following sources:
Horst, A. (2021). Dplyr-learnr. https://github.com/allisonhorst/dplyr-learnr (Original work published 2020)
Subset rows using column values—Filter. (n.d.). Retrieved 12 January 2022, from https://dplyr.tidyverse.org/reference/filter.html
Artwork was adapted from: